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Abstract 

Background: Along with the rapid digitalization of health data (e.g. Electronic Health Records), there is an increasing 
concern on maintaining data privacy while garnering the benefits, especially when the data are required to be 
published for secondary use. Most of the current research on protecting health data privacy is centered around data 
de-identification and data anonymization, which removes the identifiable information from the published health data 
to prevent an adversary from reasoning about the privacy of the patients. However, published health data is not the 
only source that the adversaries can count on: with a large amount of information that people voluntarily share on 
the Web, sophisticated attacks that join disparate information pieces from multiple sources against health data 
privacy become practical. Limited efforts have been devoted to studying these attacks yet. 

Results: We study how patient privacy could be compromised with the help of today's information technologies. 
In particular, we show that private healthcare information could be collected by aggregating and associating 
disparate pieces of information from multiple online data sources including online social networks, public records 
and search engine results. We demonstrate a real-world case study to show user identity and privacy are highly 
vulnerable to the attribution, inference and aggregation attacks. We also show that people are highly identifiable 
to adversaries even with inaccurate information pieces about the target, with real data analysis. 

Conclusion: We claim that too much information has been made available electronic and available online that 
people are very vulnerable without effective privacy protection. 



Background 

In recent years, a large amount of health data has been 
digitalized to reduce the cost and improve health care 
quality and efficiency. In various care delivering settings, 
patient health information such as demographics, pro- 
blems, medications, progress notes, laboratory data, and 
medical history are recorded in the format of Electronic 
Health Records (EHRs). By 2009, 43.9% of the U.S. medi- 
cal offices have adopted full or partial EHR systems [1]. 
The availability and legibility of electronic records not 
only facilitate sharing information in care-related activities, 
but also reduce medical errors and service time. With the 
broad adoption of EHRs, security and privacy of the 
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digitalized health data becomes extremely critical consid- 
ering the medical data are highly sensitive to the patients. 

What further complicates the issue is that with the large 
amount of health data being digitalized, there is always a 
demand to publish the data for more intelligent use. 
Immense volumes of EHRs are published every year for 
secondary use, such as medical research, public health, 
government management, and other healthcare related 
services [2]. A typical EHR consists of a set of identifier 
attributes (e.g. name, SSN), quasi-identifier attributes (e.g. 
gender, zipcode), and sensitive attributes (e.g. diseases). To 
protect the privacy of record owners, EHRs need to be de- 
identified [3-6] or anonymized [7-10] before publishing. 

The key to the adoption of EHR and other health infor- 
mation technologies is the security and privacy of the 
digitalized, highly sensitive medical data. Hence, security 
and privacy becomes an important and popular topic in 
healthcare informatics research. Current research on pro- 
tecting patient privacy in healthcare information systems 
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are centralized around the protection of EHR - that is to 
protect patient information from being abused by author- 
ized users, or being accessed by unauthorized outsiders, 
or being re-identified from health data published for sec- 
ondary use. Regulations on data disclosure are set and 
legislated to address the threat from insider malfeasance 
and enhance the protection of health related data [11]. In 
the meantime, information technologies such as access 
control, encryption, file integrity check, firewalls, and 
anti-virus mechanisms defend against both internal abuse 
of and unauthorized access to the electronic health data. 
Various data de-identification [3-6] and data anonymiza- 
tion [7-10] techniques have also been proposed to sani- 
tize the data sets in publishing and ensure that the data 
does not disclose any information about the sensitive 
data. 

However, as the Web gains its popularity and touches 
many aspects of our daily life, it becomes the largest open- 
access source of personal information. The adversaries 
possess powerful weapons and rich knowledge, which are 
somehow provided by the victims themselves and are truly 
beyond the assumptions in the research literature. 

First, large amount of public records have been made 
accessible online, including phone books, voter registra- 
tion, birth/death records, etc. Although some of them 
enforce certain restrictions to defend against abusers, it is 
still relatively easy or inexpensive to crawl/ download such 
records. More recently, online social network sites such as 
Facebook and MySpace have emerged to successfully 
attract a huge number of users, who willingly put their 
personal information to online social network sites to 
share with people. Second, with the new sophistication of 
information retrieval techniques and the advances of 
searching techniques in search engines, it becomes unex- 
pectedly easy to conduct Web-scale extraction of users' 
personal information that is readily available in various 
online social networks (e.g., [12-16]). As a result, malicious 
or curious adversaries could easily take advantage of these 
techniques to collect others' private information, which is 
readily available from online public records or various 
social networks. 

Therefore, it is reasonable for us to raise the question: 
''when an attacker possesses a small amount of (possibly 
inaccurate) information from healthcare-related sources, 
and associate such information with publicly -accessible 
information from online sources, how likely the attacker 
would be able to discover the identity of the targeted 
patient, and what the potential privacy risks are. " 

To take a first step in answering this broad question, 
we study: (1) how user information from multiple online 
sources could be associated and utilized to compromise 
user privacy; (2) how user identity could be identified by 
comparing approximate information with public 
databases. 



Results and discussion 

Attacks on healthcare records 

One effective protection on published EHR is data de- 
identification and anonymization. However, even with 
the sanitized data, sensitive attributes that pertain to an 
individual may be learned from other non-sensitive attri- 
butes in combination with external knowledge (e.g. voter 
registration list, phone books, etc.). The risks of such re- 
identification attacks have been intensively studied, 
which shows that the amount and types of an attacker's 
external knowledge play an important role in reasoning 
about privacy in data publishing [9,10,17,18]. However, it 
is not easy, if not impossible, for a data publisher to 
know upfront what external knowledge the attacker pos- 
sesses. Therefore, current research on privacy-preserving 
data publishing studies the problem from a theoretical 
perspective by making assumptions on attacker's back- 
ground knowledge, quantifying external knowledge 
regardless of its content, and sanitizing the data to ensure 
the amount of disclosure is below a specified threshold 
[17,18]. As a result, such protection, on one hand, does 
not take into account that large amount of external 
knowledge are accessible to the adversaries from various 
online sources (e.g. social networks), on the other hand, 
it might greatly distort the data and its secondary usages. 
Therefore, we believe it is of great importance to investi- 
gate the types and amounts of external knowledge that a 
powerful attacker possesses or infers from the immense 
volume of electronic data from multiple online resources. 
It not only provides evidence for efficient and optimal 
data sanitization, but also raises public concerns and 
awareness on the severeness of privacy threats and calls 
for effective protection. 

Another potential privacy attack relates to the issue of 
insider misfeasance of sensitive medical data. Health care 
delivery personnel may violate privacy rules by disclosing 
or stealing private healthcare records for unauthorized 
usages, as depicted in [19]. This is a typical abuse/infrac- 
tion with authorized data access. More often, the attack- 
ers do not have authorization for data access. They either 
eavesdrop or wiretap private information in transit or 
penetrate into EHR systems to get control of valuable 
health data. However, such types of attacks are often 
underestimated [20]. We believe such underestimation is 
partially from a fundamental misunderstanding that 
information revealed by carelessness or misuse is only 
one piece of the big picture and will not cause severe 
privacy disclosure. In this article, we will elaborate the 
severeness of such type of attacks in current information- 
rich context with an intuitive example. 

Attacks from external sources 

Online social networks (OSNs) have become extremely 
popular in recent years. Users of ONSs often voluntarily 
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disclose their personal information with surprising 
details. For example, Linked In users list their educa- 
tional and working experiences to seek for potential 
career opportunities, and MedHelp users share details 
about their life and medical experiences expecting to 
receive pertinent medical suggestions from others. 
While releasing privacy-related information online, a 
fundamental misunderstanding of these information 
owners is that it is unlikely to link information pertain- 
ing to the same individual from different online sources. 
Unfortunately, with the advances of searching and infor- 
mation retrieval techniques, it is feasible for an attacker 
to aggregate personal information of a targeted user 
from different online sources, by associating unprotected 
but identifiable or semi-identifiable attributes (e.g. iden- 
tical account names or email address of a careless user) 
[21]. Meanwhile, with governmental and industrial 
efforts, a large amount of public records have been digi- 
talized and made available online. Most of them are 
indexed by commercial search engines with free access, 
while the others only require a minimum subscription 
fee to obtain full access. Adversaries could easily access 
and utilize such information to compromise others' 
privacy, especially, their highly sensitive healthcare data. 

From our real-world case study, we find it is highly pos- 
sible for an attacker to aggregate disparate pieces of infor- 
mation from multiple (possibly medical-related) online 
sources, and associate the attributes to identify a targeted 
patient with high confidence. 

Real-world case study 

Figure 1 demonstrates an example from a real-world case 
study with a simulated attacker: "Jean" (whose full name 
has been discovered but removed here for privacy protec- 
tion) has type II diabetes, so she actively participates in 
two online medicare social network sites, MedHelp 
(http://www.medhelp.org) and MP and Thl Discussion 
Forum (http://www.curemythl.org). Assume these are 
the only two "trivial" facts that the attacker knows about 
Jean: Jean has diabetes II, and Jean has profiles in both 
OSNs. 

Since the user profiles in both OSNs are publicly acces- 
sible (after registration), the attacker's first move is to 
crawl them from both sites, which is not an impractical 
task even for attackers with limited resource and com- 
puting power. Next, the attacker analyzes the crawled 
profiles to associate the profiles with matching attributes. 
The association should be conducted under two assump- 
tions: (1) an individual is identifiable by an attribute set, 
and (2) the values of attributes in such attribute set are 
authentic information pertaining to an individual. 

The second assumption is valid considering the nature 
of online medical social networks. The primary goal of 
online medical social networks is to serve as platforms 



for doctors and patients to discuss symptoms, compare 
treatment options and exchange medical advances. 
Therefore, in order to receive unbiased medical advices, 
the users often provide real and accurate personal infor- 
mation for most of the fields in user profiles, although 
they do use pseudonyms to register with the OSNs for 
privacy protection purpose. Hence, it is reasonable to 
believe the values of non-identifiable attributes in user 
profiles are real. 

To validate the first assumption, we look at the files of 
the user profiles of both medical social network sites: 

♦ MedHelp user profile includes user name, gender, 
age, location, time joined, interest, and a text field that 
allows more detailed inputs about the user and his/her 
particular medical conditions and problems. 

♦ MP and Thl user profile contains three types of 
fields: (1) fields directly related to an user's activities in 
this site (e.g. user name, joined time); (2) fields about his/ 
her other online contacts (e.g. email, homepage, ICQ), 
and (3) fields about his/her personal information (e.g. 
birthdate, occupation, location, and a text field for speci- 
fying personal interests on medical information). 

Both profiles contain the attribute set {Gender, Loca- 
tion, Age}. According to a famous study [22] of 1990 US 
census data, 87% of the U.S. population is uniquely iden- 
tifiable by the attribute set {Gender, ZIP code, Birthdate}, 
and 53% of the U.S. population is uniquely identified by 
the attribute set {Gender, Location, Birthdate}, where the 
"location" refers to the city, town, or municipality in 
which the person resides. Since the two social networks 
target people with particular interests, it is reasonable to 
consider {Gender, Location, Age} as a quasi-identifier in 
this case. 

From the crawled data, the attacker finds two "link- 
able" profiles, as shown in Figure 1(1) and Figure 1(2): 

1. Profile 1 shows "my husband" that indicates the 
owner is a female, which is consistent with the gender 
shown in Profile 2; 

2. The locations in both profiles are the same small 
town with approximately 15K population; 

3. The birthdate in Profile 1 is consistent with the age 
shown in Profile 2. 

In addition, the pseudo user names in both profiles are 
identical (and relatively unique), and both profiles demon- 
strate interests on a same disease and symptoms - dia- 
betes type II. It is reasonable to link the two profiles at a 
certain confidence level and associate the attributes from 
both profiles to the same individual in the real-world - 
Jean. After that, more private attributes of Jean (e.g. times 
of doctor visit, diagnoses, prescriptions and medicines) are 
extracted from her postings on the two medical social net- 
work sites and added to the reconstructed profile of Jean. 

From this baseline, we continue to further explore 
Jean's private information from the Web. 
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Figure 1 A real-world example of cross-site information aggregation. The target patient "Jean" has profiles on two online medical social 
networking sites (1) and (2). By comparing the attributes from both profiles, the adversary can link the two with high confidence. Furthermore, 
the attacker can use the attribute values to get more profiles of the target through searching the Web (3) and other online public data sets (4 
and 5). By aggregating and associating the five profiles, Jean's full name, date of birth, husband's name, home address, home phone number, 
cell phone number, two email addresses, occupation, medical information including lab test results are disclosed unintendedly. 



♦ Explore by Email: Email addresses can effectively 
serve as unique identifiers. They generally provide help- 
ful hint to link two profiles. In this example, with the 
email address provided in Profile 1, we retrieved Profile 
4 (as shown in Figure 1(4)) through Web search 
engines. Profile 4 includes a phone number (which is 
found to be a cell phone number in later analysis) and a 
P.O. Box address. Both the phone number and the 
address indicate the same city as shown in Profiles 1 
and 2. 

♦ Explore by Phone Number: Phone numbers are 
commonly considered as unique identifiers. With the 
phone number from Profile 4, we further discovered 
Profile 3 (as shown in Figure 1(3)), which is a job- 
related page containing Jean's cell and home phone 
numbers. Profiles 3 and 4 both contain the full name of 
"Jean", and also provide a good hint on her occupation. 
Finally, with the home phone number, we also located 
Jean's record in the residential phone book, which 
shows her husband's name and their full home address 
(as shown in Figure 1(5)). 

Finally, in this simulated attack, we successfully acquire 
five profiles, which highly likely belong to Jean, from 
both online social networking sites and publicly accessi- 
ble online data. By associating the five profiles, we have 
recovered types of private information about Jean, 
including her full name, date of birth, husband's name, 
home address, home phone number, cell phone number, 
two email addresses, occupation and medical information 
including lab test results. 

On the other hand, even without Profiles 3, 4 and 5, 
an attacker could also utilize public records to get more 



information about Jean: with the attribute set {gender, 
birthday, location}, Jean's identity (e.g. full name, 
address, and phone number) is recoverable from public 
birth records, voters registration records or online 
phone books. With her full name, more information 
about Jean is subsequently discovered from various 
social networks. Finally, when Jean's hospital publishes 
de-identified patient records to support medical 
research, the attacker with external knowledge obtained 
from above process is highly likely to re-identify Jean's 
record. 

The example reveals a serious privacy issue in both 
social networks and healthcare informatics. The entire 
process includes three steps: attribution, inference, and 
aggregation attacks. In attribution, identifiable, semi- 
identifiable or sensitive attributes are learned/extracted 
from various sources over the web. Particularly, three 
types of online resources are considered in the example: 
(1) public-accessible online databases: voters registration 
records, phone books, birth and death records, (2) online 
social network sites with explicit identifiable attributes 
(e.g. Linkedin, Facebook, etc.) as well as specified health- 
care-related social networks (e.g. MedHelp); and (3) com- 
mercial search engines, which index a good portion of 
the web. In inference, more attributes are further discov- 
ered from social activities and relationships through 
statistical learning or logical reasoning. In aggregation, 
records retrieved from different sources that potentially 
pertain to the same individual are linked under strong or 
weak evidences, in which strong evidences include 
matching identifiers or quasi-identifiers, and weak evi- 
dences are similarities identified from a statistical 
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perspective. As we have shown in the example, the 
attacks are very valid and do not require excessive 
resources or techniques. Therefore, people are very vul- 
nerable under such attacks, if they do not carefully pro- 
tect their online identities. A powerful privacy protection 
tool is expected to defend against such attacks. 

Attacks with approximate information 

Besides privacy attacks against digitalized medical records 
and healthcare information systems, adversaries also seek 
to obtain valuable information with non-technical kind of 
intrusions such as insider incidents or social engineering. 
With a vague definition, insider incidents often involve 
abuses such as inside personnel accidental leaking or 
stealing information, using pirated software, or accessing 
questionable webpages. Social engineering relies on peo- 
ple's unawareness of valuable information and carelessness 
in protection and becomes one of the major attacks 
towards user privacy. However, in most cases, information 
obtained from non-digital channels are not accurate due 
to the difficulty of accessing information, human capabil- 
ities or errors. For example, in today s medicine practice, 
many doctors record patients' medical information (e.g. 
symptoms, diagnoses, prescriptions, etc) with a audio 
recorder, and hire external companies to convert record- 
ings into digital records. In the process, an adversary may 
steal the recording and learn detailed medical conditions 
of a patient, however, he may learn inaccurate information 
about patient's identity (e.g. he may not be able to get the 
correct spelling of the patient's name from doctor's voice). 
One may assume that the inaccuracy of attackers' knowl- 
edge may bring difficulty for them to compromise user 
identity or privacy. Unfortunately, such inaccuracy could 
be corrected by collaborating with external information 
sources, and the privacy risks causes by such attacks 
should no longer be ignored. 

Here is a simple but representative example: Dr. Bob 
treats Alice in the hospital, while Malory eavesdrops the 
conversation, or peeps the record. Malory possesses the 
full prescription with an inaccurate version of Alice's last 
name (due to Dr. Bob's squiggling handwriting). Mallory 
does not know Alice, so he starts his attack by first looking 
into the phonebook for all "similar" names in the neigh- 
borhood. The question is: What is Malory's opportunity of 
accurately recovering Alice's full namel 

Ic-approximate-anonymity 

To further articulate this problem, we define k-approxi- 
mate-anonymity, 

[k-approximate-anonymity]: Given a data-set D, and 
a distance function dist{ri, r2) that returns the distance 
for any two records on the dataset; for any record r, if 
there exists k - 1 records r^ that dist{ry r^) <= I where / 



is a preset threshold, we conclude that D satisfies k- 
approximate-anonymity or /c-/- anonymity with dist 

In the above definition, when / = 0, it becomes the 
original k-anonymity. It basically says that when Mallory 
possesses approximate information on a target, he can- 
not distinguish the target from k - 1 other records in 
the database. 

To simulate the above scenario, we have designed an 
experiment to study the identifiability of real names in 
the presence of inaccurate information from the attack- 
ers. We first implement a crawler to download the public 
residential phone book. In a few days, it successfully col- 
lects 24,399 records from State College area, which cov- 
ers approximately 64% of the population (according to 
2000 census data). In each record, we have phone num- 
ber, first and last names, and full residential address. In 
the experiments, we use full name as identifiers, and use 
the Levenshtein distance (edit distance) [23] as the dis- 
tance function. For different threshold /, we show the 
population whose names are protected under k-l-anon- 
ymity in Figure 2. 

From the figure, we can see that, with larger /, people are 
less identifiable with their names. However, overall, most 
(more than 70%) people are uniquely identifiable even 
when 1=2. It means that even though Mallory gets an inac- 
curate name of the target, he has a good chance to correct 
the mistake and limit the target to a small range with the 
help of digital phonebooks. Even when Mallory gets four 
letters wrong in the name, in more than 80% of the cases, 
his target is limited to no more than 5 candidates, i.e., he 
only needs to further examine no more than 5 records to 
identify the target. As we expected, people with longer 
names or unusual names are more vulnerable, while peo- 
ple with shorter or more popular names are less identifi- 
able, especially when the attacker possesses inaccurate 
information. 

Conclusions 

In this work, we study the privacy vulnerabilities when 
medical records join with the Web. First, we show that 
multiple information sources (e.g. social networks and 
public records) could be utilized by the attackers. With 
attribution, inference and aggregation attacks, the 
attacks are capable of reconstructing very comprehen- 
sive user profiles, with various types of highly sensitive 
and private information (e.g. names, phone numbers, 
birth dates, diseases, lab test results, etc). On the other 
hand, we show that people are very identifiable if the 
attackers are equipped with information retrieval and 
data mining techniques. Even though an attacker only 
possesses a piece of inaccurate information, he is still 
highly likely to identify the target with the help of exter- 
nal information sources. 
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Figure 2 Population under l^-l-anonymity. This example measures the distinguishness of the population under k-l-anonymity, using the first 
name as the identifier. The histogram plots the number of individuals whose first names differ with other (/c-1) records for / letters. It shows 
within 24,399 records, most of them (more than 70% as shown in the first bar with k=]) are quite distinguishable. 
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